/  AD-AQ9d  225  YALE  UNIV  NEW  HAVEN  CT  DEPT  OF  COMPUTER  SCIENCE  F/6  12/1 

ON  SOME  TRENDS  IN  ELLIPTIC  PROBLEM  SOLVERS. (U) 

FEB  81  SC  EISENSTAT »  M  H  SCHULTZ  N00014-76-C-0277 

UNCLASSIFIED  TR-194/81  NL 


YALE  UNIVERSITY 

DEPARTMENT  OF  COMPUTER  SCIENCE 


81  4  6  202 


/ 


r- 


DT!C 


ON  SOME  TRENDS  IN  ELLIPTIC  PROBLEM  SOLVERS# 

_.0 — ~ — T -  *  —  1„  * 

:  s  c.  Aisenstat  mm  m.  h.  tScbultz 


7  cx 


I  (  V 


V  I* 


TECHNICAL  REPORT  #194/81 

fit  niKUR  081  | 


/ 


7: 


/  N 


^£7 


/>  rrt-^yy/s.S 


This  vork  was  supported  in  part  by  the  0. S.  Office  of  Navel  Research 
under  Grant  'fr0tfol4-76-C-0277^  ^ 

Q?  s/f  ''  : 

Apprr.-.  ,f  f  r 

LV- 


"’ll  .30; 


1.  Introduction 


elliptic  boundary  value  problem*  are  at  the  core  of  aany 
systems  of  partial  differential  equation*  occurring  in  aechanica. 

Example*  of  application*  include  fluid  dynamic*,  semiconductor 
device  modelling,  and  atructural  analyai*.  -In  addition,  they  often 
appear  a*  a  result  of  applying  an  implicit  difference  approximation 
to  the  time  derivative  in  a  parabolic  problem.  >  Thus,  it  is 
important  to  have  efficient  and  robust  elliptic  problem  solvers. 

Moreover,  ve  expect  that  many  of  the  issues  faced  in  the  development 
of  such  solvers  are  typical  of  those  to  be  faced  in  the  development 
of  large-scale  scientific  problem  solvers. 

The  development  of  high-technology  elliptic  problem  solvers 
must  include  serious  consideration  of  algorithms,  architecture,  and 
applied  mathematics,  each  traditionally  the  subject  of  an  entire 
discipline.  Recent  and  expected  advances  in  algorithm  and  hardware 
technologies  strongly  suggest  the  possibility  of  both  dramatically  j  ]>|  ■.{ r; 
reducing  the  cost  and  dramatically  increasing  the  scope  of  1 

scientific  computational  modelling.  But  an  integrated  highly 
interdisciplinary  approach  will  be  necessary  to  realise  this 
possibility. 

Success  in  the  development  of  high-technology  elliptic  problem 
solvers  will  significantly  increase  the  effectiveness  and 
productivity  of  engineers  and  scientists  by  decreasing  the  cost  of 
their  computing,  by  increasing  the  efficiency  with  which  they  can 
use  the  computer,  and  by  allowing  them  to  solve  increasingly  more 
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complex  problems,  which  are  computationally  intractable  on  even  the 
fastest  machines  available  today. 

>  In  this  paper  ve  discuss  some  of  the  issues  involved  in  the 
design  of  a  high-technology  elliptic  problem  solver.  In  particular, 
we  will  concentrate  our  attention  on  the  design  of  a  modular, 
heterogeneous  multi-processor  elliptic  problem  solver  consisting  of 
a  host  computer  and  one  or  more  peripheral  processors.  ^ - 

We  wish  to  acknowledge  the  help  of  our  current  and  former 
students  and  colleagues  in  the  Department  of  Computer  Science  of 
Tale  University  in  developing  the  ideas  presented  here.  In 
particular,  our  colleague  Josh  Fisher  played  a  key  role  in 
formulating  the  ideas  in  the  Architecture  section  of  this  paper. 

2.  Architecture 

We  propose  a  high-technology  elliptic  problem  solver 
consisting  of  a  heterogeneous  multi-processor  computer  system.  This 
system  would  have  a  host  machine  (such  as  a  minicomputer  with  a 
32-bit  word  length  and  an  operating  system  supporting  virtual 
memory)  and  one  or  more  highly  optimised  peripheral  processors.  An 
important  practical  advantage  to  this  type  of  architecture  is  the 
fact  that  one  does  not  have  to  design  a  complete  hardware-software 
system  to  gain  great  computational  power. 

Commercial  high-performance,  programmable  peripheral 
processors,  which  we  will  refer  to  hereafter  as  attached  processors, 
have  become  quite  popular  for  signal  processing.  These  processors 
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feature  tailored  scientific  instruction  sets,  direct  user 
microprogramming,  and  very  wide  instruction  words.  This  provides 
potentially  very  large  speed-ups,  but  makes  the  processors  extremely 
difficult  to  code  for  efficient  performance.  Clearly,  programming 
tools  such  as  optimising  FORTRAN  compilers  are  a  critical  need  if 
the  use  of  these  devices  is  to  become  commonplace.  It  seems  quite 
likely  that  for  the  next  five  years  such  attached  processors  will 
provide  a  very  attractive  architecture  for  rapid  and  cost-effective 
scientific  computation. 

The  value  of  dual-processor  systems  involving  a 
general-purpose  host  and  an  attached  processor  for  solving  elliptic 
problems  has  yet  to  be  convincingly  demonstrated.  Little  is  known 
about  many  fundamental  issues.  Perhaps  the  most  critical  is  the 
identification  of  which  key  subalgorithms  attached  processors  can  do 
sufficiently  well  to  insure  that  the  cost  of  moving  the  data 
describing  the  problem  does  not  overwhelm  the  gain  obtained. 

Indeed,  short  of  moving  the  entire  problem  to  the  attached 
processor,  which  may  not  be  feasible  because  of  programming  and 
memory  considerations,  it  is  not  a  priori  clear  that  such 
subalgorithms  exist.  If  this  multi-processor  approach  is  to  be 
successful,  the  host  machine  must  be  capable  of  supporting 
peripheral  processors  via  a  very  fast  specialised  bus,  rather  than  a 
slow  general-purpose  comranications  bus. 

In  thinking  about  the  architecture  of  an  elliptic  problem 
solver,  we  should  look  beyond  current  commercial  attached 
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processors.  VLSI-bssed  peripheral  processors,  which  provide 
extremely  fast  cycle  tine  because  of  high  density,  the  potential  for 
specification  of  highly  concurrent  calculations,  very  low  cost,  and 
easy  design  with  autoaated  design  tools,  way  turn  out  to  be 
revolutionary. 

In  combination,  these  attractions  will  allow  us  to  fabricate 
special-purpose  processors  with  the  potential  for  massively  reducing 
computation  time.  Furthermore,  they  introduce  the  possibility  of 
moving  many  of  the  complexities  of  large  scientific  software  systems 
into  hardware  and  in  this  way  drastically  reducing  the  cost  of 
scientific  programming.  The  potential  of  this  technology  is  so 
great  that  it  promises  to  revolutionize  our  concept  of  what  is  a 
computationally  tractable  problem. 

As  Kung  [6,  7]  has  emphasized,  the  key  to  cost-effective 
design  and  performance  analysis  of  VLSI  is  the  underlying  high-level 
algorithm.  In  order  to  maximize  chip  density  at  any  level  of 
fabrication  technology,  it  is  necessary  to  have  an  architecture  that 
involves  simple  and  regular  interconnections.  Hence,  underlying  the 
chip  should  be  an  efficient  parallel  algorithm  that  has  simple  and 
regular  movements  of  data.  Moreover,  to  maximize  the  throughput  of 
the  peripheral  processor,  the  slgorithm  should  use  pipe-lining 
techniques  to  overlap  I/O  with  computation. 

Despite  its  extraordinary  potential,  VLSI  design  and 
fabrication  are  still  somewhat  experimental.  The  current  design 
process  is  lengthy  and  chip  densities  of  the  type  we  need  are  still 


mil  is  the  future.  He  believe,  however,  thet  ie  sot  too  early  to 
begin  thinking  about  the  beeic  issues. 


3.  Algorithm 

As  the  peripheral  processors  and  their  algorithms  become 
computationally  more  effective,  we  run  an  increasing  risk  that  the 
whole  system  will  founder  on  communications.  The  turn-around  or 
wall-clock  time  for  a  given  simulation  will  be  bounded  below  by 
communication  time,  the  time  needed  to  move  the  data  between  the 
host  and  the  peripheral  processors. 

He  will  focus  our  attention  in  this  section  of  the  paper  on 
some  of  the  implications  algorithms  have  for  communications.  We 
believe  that  the  interplay  between  architecture  end  algorithms  for 
such  multi-processor  systems,  and  in  particular  their  impact  on  the 
turn-around  time  for  large  problems,  is  ill  understood.  Much 
theoretical  modelling  and  experimentation  remains  to  be  done.  Our 
intent  is  to  illustrate  some  of  the  issues. 

Generally  speaking,  algorithms  for  elliptic  problems  at  some 
stage  reduce  to  solving  very  large  sparse  linear  systems.  These 
linear  algebra  subproblems  have  two  distinct  computational  phases, 
the  assembly  (i.e. ,  the  computation)  of  the  linear  systems  to  be 
solved  and  the  solution  of  the  linear  systems. 

This  leads  to  many  questions.  Hhat  is  the  class  of  algorithms 
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for  assembling  linear  »ys terns?  What  are  the  implications  of  each 
assembly  algorithm  for  computation  and  communication  costs?  Vhat  is 
the  class  of  algorithms  for  solving  the  resulting  highly  structured 
sparse  linear  systems?  Vhat  are  the  implications  of  each  solution 
algorithm  for  computation  and  communication  costs?  Vhat  are  the 
bandvidths  for  data  flow  between  different  levels  of  memory  and 
processors?  How  much  fast  memory  is  needed  to  guarantee  that 
algorithms  will  be  compute-bound?  Vhich  subalgorithms  can  and 
should  be  implemented  on  attached  processors?  For  which  algorithms 
should  special-purpose  VLSI  based  processors  be  built? 

How  can  we  reduce  turn-around  time?  Despite  what  the  textbooks 
tell  us,  neither  algorithms  that  are  more  efficient  in  arithmetic 
operations  nor  faster  peripheral  processors  will  in  themselves 
automatically  reduce  turn-around  time.  The  key  is  to  do  something 
about  I/O.  In  the  remainder  of  this  section,  we  will  illustrate 
four  general  techniques  for  addressing  this  issue: 

1.  Hide  the  I/O,  i.e. ,  overlap  the  I/O  as  much  as  possible 
with  useful  computation  (under  the  assumption,  of  course, 
that  the  computing  environment  supports  user-controllable 
parallel  execution  of  I/O  and  computation). 

2.  Trade  I/O  off  for  incremental  arithmetic  computation 
until  the  two  are  completely  overlapped. 

3.  Use  linear  algebra  techniques  to  reduce  the  amount  of 
required  storage  and  hence  the  data  flow. 

A.  Use  analytic  techniques  to  reduce  the  amount  of  required 
storage  and  hence  the  data  flow  by,  for  example,  using 
adaptive  high-order  discretisations. 


In  order  to  illustrate  these  techniques,  ve  consider  the  model 
problem  of  a  second-order,  linear,  self-adjoint  elliptic  boundary 
value  problem  in  the  unit  square.  He  discuss  hov  each  technique  can 
be  used  to  improve  turn-around  time. 

For  the  first  three  techniques  ve  consider  solving  the 
standard  five-point  central-difference  approximation  to  the  model 
problem  on  a  uniform  nxn  grid  [111.  To  illustrate  hov  ve  might 
overlap  I/O  with  computation  in  a  multi-processor  system,  ve 
consider  the  case  of  a  dual-processor  system,  a  central  processing 
unit  and  an  I/O  processor,  vith  dual-ported  fast  primary  memory  and 
slow  secondary  memory,  such  as  a  disk.  He  assume  that  the  I/O 
processor  is  user-programmable  and  that  it  can  operate  in  parallel 
vith  the  CPU. 

Suppose  ve  wish  to  compute  the  Cholesky  factorization  [5]  of 
the  model  matrix,  but  that  the  mesh  is  arbitrarily  large,  i. e. ,  the 
number  of  mesh  points,  n,  in  each  direction  is  arbitrarily  large,  so 
that  the  band  of  the  matrix  cannot  be  stored  in  primary  memory. 

Hhat  can  ve  say  about  communication  costs  and  turn-around  time? 

Along  vith  Jack  Perry,  ve  formulated  and  proved  a  rather  startling 
result  [8]  about  "out-of-core"  Cholesky  factorizstion  algorithms, 
which  ve  state  informally. 

Given  the  ratio  of  the  speeds  of  the  two  processors,  there 
exists  a  constant  M  such  that  for  all  systems  vith  at  least 
M  words  of  primary  memory  there  exists  a  block  Cholesky 
factorizstion  algorithm  and  a  two-processor  schedule  so  thst 
the  factorisation  is  compute-bound  independent  of  n. 

It  is  important  to  observe  that,  in  discussing  softvsre  for 


multi-processor  systems,  one  Bust  give  not  only  a  computational 

algorithm  in  the  classical  sense  for  serial  machines  but  also  a 

specification  for  the  flow  of  data  between  memories  and  processors 

and  a  schedule  for  the  processors.  The  specific  case  under 

discussion  is  particularly  startling  in  that  using  larger  amounts  of 

primary  memory  beyond  the  minimum  M  given  by  the  preceding  result 

actually  increases  the  turn-around  time.  If  we  assume  that  the  band 

of  the  matrix  can  fit  into  primary  memory  but  is  initially  stored  in 

secondary  memory,  we  can  prove  the  following  result  [8]: 

The  turn-around  time  of  the  above  "out-of-core"  algorithm  is 
shorter  than  the  turn-around  time  required  for  (1)  reading 
the  band  into  primary  memory  and  (2)  doing  an  "in-core" 

Cholesky  factorization. 

Many  currently  available  computers  do  not  allow  us  to  control 

their  I/O  processors  explicitly,  but  do  have  a  virtual  memory  system 

using  a  heuristic  page  replacement  algorithm  such  as  LRU  (Least 

Recently  Used).  We  have  been  able  to  prove  the  following  result 

about  this  situation;  see  Perry  [8]  for  a  formal  statement  and  proof. 

Given  a  fixed  amount  of  primary  memory  and  page  size,  there 
exists  a  block  Cholesky  factorization  algorithm,  independent 
of  n,  that  minimizes  the  number  of  page  faults. 

Despite  what  the  operating-systems  experts  of  several 
commercial  vendors  claim,  and  despite  the  fact  that  automatic  paging 
may  minimize  the  number  of  times  the  data  must  be  moved,  given  the 
choice  between  using  automatic  paging  and  scheduling  the  I/O, 
scheduling  the  I/O  is  bound  to  be  more  efficient  in  turn-around 
time.  The  point  is  that  automatic  paging  schemes  do  not  necessarily 
move  the  data  at  the  right  time  to  overlap  it  with  useful 
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computation.  Such  considerations  certainly  should  be  of  major 

importance  in  designing  systems  for  scientific  computation. 

The  last  three  general  techniques  for  minimizing  turn-around 

time  are  based  on  two  observations:  (1)  Reducing  storage  is  likely 

to  reduce  I/O.  (2)  There  often  exists  a  trade-off  between 

solve-time  and  storage  (or  I/O  time). 

To  illustrate  a  trade-off  between  solve-time  and  storage,  we 

again  consider  the  model  linear  system.  The  classical  band  Cholesky 

A  3 

factorization  algorithm  requires  1/2  n4  +  0(nJ)  multiplications  and 

3 

n  storage.  However,  along  with  Andrew  Sherman  we  developed  a 
divide-and-conquer  band  elimination  or  minimal  storage  band 

elimination  algorithm  l 3 ]  that  requires  5/6  n^  +  O(n^) 

2 

multiplications  and  (n+1)  storage.  Thus  we  can  gain  an 
order-of-magnitude  improvement  in  the  required  storage  at  the 
expense  of  a  constant-factor  increase  in  the  required  work.  The 
reduction  in  primary  memory  occupancy  is  particularly  dramatic. 

The  third  general  technique  for  reducing  turn-around  is  to  use 
sophisticated  linear  algebra  methods.  This  is  a  place  where  more 
classical  kinds  of  numerical  analysis  play  a  very  important  role. 
Four  of  our  favorite  linear  algebra  methods  are  (1)  sparse 
elimination  [10],*  (2)  minimal  storage  sparse  elimination  [4];  (3) 
preconditioned  conjugate  gradient  methods  [2];  and  (4) 
multi-grid  [ll.  These  four  methods  have  the  following  asymptotic 
costs  for  the  model  problem: 
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Multinlications 

Storage 

(1)  sparse  elimination 

0(n3) 

o 

0(n  log  n) 

(2)  minimal  storage 

sparse  elimination 

0(n3) 

0(n2) 

(3)  preconditioned  conjugate 
gradient  method 

0(n2, 3  log  n) 

0(n2) 

(4)  multi-grid 

0(n2) 

0(n2) 

These  methods  have  the  following  virtues: 


( 1 )— ( 2) :  (a)  Good,  general  codes  available. 

(b)  Efficient  for  two-dimensional  problems. 

(3) :  (a)  Generally  applicable  with  promise  of  good  codes. 

(b)  Efficient  for  two-  and  three-dimensional  problems. 

(c)  Low  storage  requirement. 

(4) :  (a)  Asymptotically  optimally  efficient. 


On  the  other  hand,  these  methods  have  the  following  possible 
shortcomings : 

(l)-(2):  (a)  Inefficient  in  three  dimensions. 

(b)  Large  storage  requirements,  especially  in  three 
dimensions. 

(3) :  (a)  In  some  cases,  parameters  need  to  be  estimated. 

(4) :  (a)  Difficult  to  implement  efficiently. 

(b)  Difficult  to  implement  for  general  problems. 


What  can  ve  conclude  from  these  asymptotic  results?  Methods 


(2),  (3),  and  (4)  have  storage  proportional  to  the  number  of 
unknowns.  In  theory,  multi-grid  looks  like  the  clear  winner.  In 
practice,  however,  it  is  so  difficult  to  implement  and  has  such 
complex  logic  and  overhead  that  it  may  not  be  efficient  on  any 
existing  peripheral  processor  —  maybe  not  even  on  one  specially 
designed. 

Th?  last  but  perhaps  most  powerful  way  to  reduce  time  and 
storage  i6  analytic:  the  use  of  adaptive  high-order  finite 
difference  or  finite  element  discretization  techniques.  Because  the 
techniques  are  high-order,  for  a  given  accuracy  the  discretization 
involves  few  parameters  and  little  storage  and  so  can  be  solved 
quickly.  Because  the  techniques  are  adaptive,  we  can  attain  high 
convergence  rates  for  smooth  and  nonsmooth  problems  alike. 

As  an  example,  consider  the  use  of  piecewise  bicubic 
polynomials  with  the  Rayleigh-Ritz-Galerkin  method  [9].  Such  a 
method  requires  c  n^  +  k  n**,  2£  q£  4,  k«c  multiplications,  where 
the  first  term  corresponds  to  assembling  the  matrix  and  the  second 
term  corresponds  to  solving  it.  The  value  of  q  depends  on  the 
method  used  to  solve  the  linear  system.  In  general,  for  a  given 
fixed  accuracy,  the  linear  systems  for  high-order  discretizations 
are  much  smaller,  and  hence  have  more  "information  content"  per 
nonzero  entry,  than  the  linear  systems  for  lov-order 
discretizations.  But  the  computation  per  nonzero  to  assemble  the 
matrix  is  much  higher.  Thus,  we  may  draw  several  conclusions: 
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1.  Solving  the  linear  system  for  a  high-order  approximation 
is  relatively  easy. 

2.  I/O  is  reduced  and  is  often  unnecessary  for  high-order 
methods. 

3.  The  assembly  times  are  relatively  high. 

Clearly  ve  should  focus  on  optimizing  the  assembly  phase.  The 
first  thing  to  investigate  in  the  use  of  a  peripheral  processor  is 
hov  veil  the  assembly  phase  can  be  implemented.  What  can  be  done 
algorithmically  to  improve  the  assembly  phase?  Our  studies  with 
Alan  Weiser  [12]  indicate  that  it  is  advantageous  to  use  smooth 
tensor-product,  piecevise-polynomial  basis  functions  and  to  take 
advantage  of  all  possible  symmetries.  For  example,  ve  get  the 
folloving  assembly  costs  per  interior  element: 


Moreover,  smoothness  has  a  beneficial  effect  on  the  solve 
phase  in  terms  of  vork  and  storage.  The  folloving  table  gives  the 
relative  operation  counts  and  storage  for  the  Cholesky  factorization 
of  the  corresponding  band  matrix: 
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Smoothness 

Relative  Multiplication 
Counts 

Relative 

Storage 

C° 

364. 5 

81 

c1 

32.0 

16 

c2 

4.5 

3 

This  shows  that  symmetries  can  play  an  important  role  in  the 
assembly  phase.  If  possible,  it  is  highly  advantageous  to  map  the 
problem  domain  onto  a  union  of  rectangles  and  to  use  smooth 
tensor-product 6  of  piecewise  polynomials. 

Unfortunately,  actual  computational  experience  for  realistic 
problem  sizes  giving  engineering  accuracy  is  not  quite  as  clear-cut 
as  we  would  like.  The  adaptive  high-order  methods  have  a  very  large 
overhead  because  of  to  their  logical  complexity.  In  practice  we  are 
trading  arithmetic  operations  and  storage  for  logical  complexity. 
Moreover,  it  is  not  at  clear  whether  we  can  effectively  implement 
these  methods  on  a  multi-processor  system. 

When  it  comes  to  turn-around  time,  we  do  not  know  whether  it 
is  better  to  use  a  simple  low-order  method  that  involves  a  lot  of 
data  but  can  be  implemented  very  effectively  on  a  multi-processor  or 
a  sophisticated  adaptive  high-order  method  that  involves  little  data 
but  might  not  be  effectively  implementable.  We  cannot  yet  answer 
this  fundamental  question. 
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